I used the same datasets as before:
I chose 4 datasets: box, on_ice, gar, and xgar. All four datasets contain individual NHL player stats during the 2020-2021 NHL regular season, and were downloaded as CSVs from Evolving-Hockey, a site to which I have a membership. “box” and “on ice” contain data that is scraped from the NHL play by play data as well as some data calculated by a model built by Evolving Hockey. “gar” and “xgar” contain data that is fully calculated by the aforementioned model.
“box” contains basic box score data for each player at 5v5, such as games played (GP), time on ice (TOI), goals (G), primary assists (A1), secondary assists (A2), and Points. The following columns contain more granular data such as individual shots for (iSF), individual expected goals (ixG), giveaways (GIVE), takeaways (TAKE), etc. “on_ice” has data for the events that occur when a player is on the ice, such as corsi (shot attempts) for and against, as well as the rate they occurred at (calculated as number per 60 minutes played). “gar” stands for Goals Above Replacement, which comes from a regression model built by Evolving Hockey that attempts to isolate the individual player impact on the game by calculating the value of a players contributions in different aspects of the game in terms of goals above replacement level (which is 0), a concept derived from baseball’s WAR (wins above replacement). The columns include numbers like “EVO_GAR”, which means the goals above replacement by a player at even strength offense, “EVD_GAR” – GAR at even strength defense, total goals above replacement (GAR), wins above replacement (WAR), and standings points above replacement (SPAR), etc. “xgar” is the same, but calculates the expected goals above replacement rather than the actual GAR.
This data is interesting to me because it looks beyond just “points” when it comes to measuring a player’s impact on the game. It will be a unique perspective to look at players who at first glance at the scoresheet may seem unimpactful because of the lack of points, but are actually contributing in other ways, such as defensively. I believe that the top scorers in the league will have a high number of points and a high EVO_GAR, but it may be surprising to see some of the MVP candidates to be lacking in other areas such as defense and CF%.
library(tidyverse)
box <- read_csv("/Users/olivia/Downloads/EH_std_sk_stats_5v5_regular_no_adj_2021-12-10.csv")
on_ice <- read_csv("/Users/olivia/Downloads/EH_std_sk_stats_5v5_regular_adj_2021-12-10.csv")
gar <- read_csv("/Users/olivia/Downloads/EH_gar_sk_stats_regular_2021-12-10.csv")
xgar <- read_csv("/Users/olivia/Downloads/EH_xgar_sk_stats_regular_2021-12-10.csv")
head(gar)
## # A tibble: 6 × 18
## Player Season Team Position GP TOI_All EVO_GAR EVD_GAR PPO_GAR SHD_GAR
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aaron Ekb… 20-21 FLA D 35 878. -1.4 1.2 3 -0.3
## 2 Adam Boqv… 20-21 CHI D 35 595. 2 -1.6 0.2 0
## 3 Adam Broo… 20-21 TOR C 11 117. 0.6 1 0 0
## 4 Adam Erne 20-21 DET L 45 626. 1.8 -2 0.6 0
## 5 Adam Fox 20-21 NYR D 55 1359. 2.4 4.7 2.5 -0.8
## 6 Adam Gaud… 20-21 CHI C 7 85.4 0.8 -1 0.1 0
## # … with 8 more variables: Take_GAR <dbl>, Draw_GAR <dbl>, Off_GAR <dbl>,
## # Def_GAR <dbl>, Pens_GAR <dbl>, GAR <dbl>, WAR <dbl>, SPAR <dbl>
# tidying
full_data <- inner_join(box, on_ice, by=c("Player","Team")) %>% inner_join(gar, by=c("Player","Team")) %>% inner_join(xgar, by=c("Player","Team"))
clean_data <- full_data %>% filter(TOI.x > 200) %>% select(-c(Season.y, Position.y, GP.y, Season.x.x, Position.x.x, GP.x.x, TOI_All.x, Season.y.y, Position.y.y, GP.y.y, TOI_All.y, Take_GAR.y, Draw_GAR.y, Pens_GAR.y)) %>% rename(Season = Season.x,
Position = Position.x,
GP = GP.x,
"TOI_All" = TOI.x,
"iSh%" = "Sh%.x",
"TOI_5V5" = TOI.y,
"Sh%" = "Sh%.y",
Draw_GAR = Draw_GAR.x,
Take_GAR = Take_GAR.x,
Pens_GAR = Pens_GAR.x)
head(clean_data)
## # A tibble: 6 × 74
## Player Season Team Position GP TOI_All G A1 A2 Points iSF
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aaron Ekbl… 20-21 FLA D 35 603. 4 2 2 8 57
## 2 Adam Erne 20-21 DET L 45 537. 8 0 5 13 61
## 3 Adam Fox 20-21 NYR D 55 954. 2 8 8 18 65
## 4 Adam Henri… 20-21 ANA C 45 568. 10 5 2 17 58
## 5 Adam Lowry 20-21 WPG L 52 621. 9 6 6 21 65
## 6 Adam Pelech 20-21 NYI D 56 1000. 3 4 5 12 72
## # … with 63 more variables: iFF <dbl>, iCF <dbl>, ixG <dbl>, iSh% <dbl>,
## # FSh% <dbl>, xFSh% <dbl>, iBLK <dbl>, GIVE <dbl>, TAKE <dbl>, iHF <dbl>,
## # iHA <dbl>, iPENT2 <dbl>, iPEND2 <dbl>, iPENT5 <dbl>, iPEND5 <dbl>,
## # iPEN± <dbl>, FOW <dbl>, FOL <dbl>, FO± <dbl>, TOI_5V5 <dbl>, GF% <dbl>,
## # SF% <dbl>, FF% <dbl>, CF% <dbl>, xGF% <dbl>, GF/60 <dbl>, GA/60 <dbl>,
## # SF/60 <dbl>, SA/60 <dbl>, FF/60 <dbl>, FA/60 <dbl>, CF/60 <dbl>,
## # CA/60 <dbl>, xGF/60 <dbl>, xGA/60 <dbl>, G±/60 <dbl>, S±/60 <dbl>, …
clean_data <- clean_data %>% mutate("G/60" = G/(TOI_All/60),
"A1/60" = A1/(TOI_All/60),
"A2/60" = A2/(TOI_All/60),
"Points/60" = Points/(TOI_All/60),
"primary_points" = G+A1,
"primary_points/60" = (G+A1)/(TOI_All/60))
# creating binary variable
clean_data <- clean_data %>% mutate(Forward = ifelse(Position == 'D', 0, 1))
head(clean_data)
## # A tibble: 6 × 81
## Player Season Team Position GP TOI_All G A1 A2 Points iSF
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aaron Ekbl… 20-21 FLA D 35 603. 4 2 2 8 57
## 2 Adam Erne 20-21 DET L 45 537. 8 0 5 13 61
## 3 Adam Fox 20-21 NYR D 55 954. 2 8 8 18 65
## 4 Adam Henri… 20-21 ANA C 45 568. 10 5 2 17 58
## 5 Adam Lowry 20-21 WPG L 52 621. 9 6 6 21 65
## 6 Adam Pelech 20-21 NYI D 56 1000. 3 4 5 12 72
## # … with 70 more variables: iFF <dbl>, iCF <dbl>, ixG <dbl>, iSh% <dbl>,
## # FSh% <dbl>, xFSh% <dbl>, iBLK <dbl>, GIVE <dbl>, TAKE <dbl>, iHF <dbl>,
## # iHA <dbl>, iPENT2 <dbl>, iPEND2 <dbl>, iPENT5 <dbl>, iPEND5 <dbl>,
## # iPEN± <dbl>, FOW <dbl>, FOL <dbl>, FO± <dbl>, TOI_5V5 <dbl>, GF% <dbl>,
## # SF% <dbl>, FF% <dbl>, CF% <dbl>, xGF% <dbl>, GF/60 <dbl>, GA/60 <dbl>,
## # SF/60 <dbl>, SA/60 <dbl>, FF/60 <dbl>, FA/60 <dbl>, CF/60 <dbl>,
## # CA/60 <dbl>, xGF/60 <dbl>, xGA/60 <dbl>, G±/60 <dbl>, S±/60 <dbl>, …
After loading the data, I cleaned the data by filtering out all players who played less than 200 minutes total as that is not a sufficient sample size to calculate rate stats, removing repeat columns from joins, and renaming columns to be more clear. I then added a column using mutate to create a binary variable to indicate whether a player was a forward (1) or a defenseman (0).
nrow(clean_data)
## [1] 422
clean_data %>% group_by(Position) %>% summarize(n = n())
## # A tibble: 4 × 2
## Position n
## <chr> <int>
## 1 C 151
## 2 D 143
## 3 L 70
## 4 R 58
clean_data %>% group_by(Forward) %>% summarize(n = n())
## # A tibble: 2 × 2
## Forward n
## <dbl> <int>
## 1 0 143
## 2 1 279
clean_data %>% summarize(n_distinct(Player))
## # A tibble: 1 × 1
## `n_distinct(Player)`
## <int>
## 1 422
There are 422 total observations in the clean dataset. All the observations are distinct after the data was cleaned and duplicates removed. When grouping by position of players who played over 200 minutes at 5v5, there are 151 centers, 143 defensemen, 70 left wingers, and 58 right wingers. When grouping by the binary variable “Forward”, there are 143 defensemen, and 279 forwards.
library(cluster)
pam_data_points <- clean_data %>% select(c(G/60,Points/60,primary_points/60,GAR,xGAR))
head(pam_data_points)
## # A tibble: 6 × 5
## G Points primary_points GAR xGAR
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 8 6 2.6 10.5
## 2 8 13 8 -0.6 -0.7
## 3 2 18 10 9.8 9.3
## 4 10 17 15 3.6 0.3
## 5 9 21 15 5.2 2.7
## 6 3 12 7 9.4 2.4
sil_width<-vector() #empty vector to hold mean sil width
for(i in 2:10){
kms <- kmeans(pam_data_points,centers=i) #compute k-means solution for each k
sil <- silhouette(kms$cluster,dist(pam_data_points)) #get sil widths
sil_width[i]<-mean(sil[,3]) #take averages (higher is better)
}
ggplot()+geom_line(aes(x=1:10,y=sil_width))+scale_x_continuous(name="k",breaks=1:10)
set.seed(322)
hockey_pam <- pam_data_points %>% pam(k=2)
hockey_pam
## Medoids:
## ID G Points primary_points GAR xGAR
## [1,] 281 3 10 7 2.4 0.9
## [2,] 46 10 22 17 5.1 2.0
## Clustering vector:
## [1] 1 1 1 2 2 1 1 2 2 1 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 1 2 2 1 2 1 2 1 2 1 2 1
## [38] 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1 2 1 2 2 2 2 1 2 1 1 2 2 2 1 1 2 1 1 1 1 2 1
## [75] 1 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 2 1 1 2
## [ reached getOption("max.print") -- omitted 322 entries ]
## Objective function:
## build swap
## 8.906381 7.687138
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
plot(hockey_pam,which=2)
pamclust<-pam_data_points %>% mutate(cluster=as.factor(hockey_pam$clustering))
#G/60,Points/60,primary_points/60,GAR,xGAR
#G/60 vs Points/60
pamclust %>% ggplot(aes(G/60,Points/60,color=cluster)) + geom_point() + scale_x_continuous()
#G/60 vs primary_points/60
pamclust %>% ggplot(aes(G/60,primary_points/60,color=cluster)) + geom_point() + scale_x_continuous()
#G/60 vs GAR
pamclust %>% ggplot(aes(G/60,GAR,color=cluster)) + geom_point() + scale_x_continuous()
#G/60 vs xGAR
pamclust %>% ggplot(aes(G/60,xGAR,color=cluster)) + geom_point() + scale_x_continuous()
#Points/60 vs primary_points/60
pamclust %>% ggplot(aes(Points/60,primary_points/60,color=cluster)) + geom_point() + scale_x_continuous()
#Points/60 vs GAR
pamclust %>% ggplot(aes(Points/60,GAR,color=cluster)) + geom_point() + scale_x_continuous()
#Points/60 vs xGAR
pamclust %>% ggplot(aes(Points/60,xGAR,color=cluster)) + geom_point() + scale_x_continuous()
#primary_points/60 vs GAR
pamclust %>% ggplot(aes(primary_points/60,GAR,color=cluster)) + geom_point() + scale_x_continuous() + scale_y_continuous(breaks=seq(0,30,by=5))
#primary_points/60 vs xGAR
pamclust %>% ggplot(aes(primary_points/60,xGAR,color=cluster)) + geom_point() + scale_x_continuous()
#GAR vs xGAR
pamclust %>% ggplot(aes(GAR,xGAR,color=cluster)) + geom_point() + scale_x_continuous()
clean_data%>%slice(hockey_pam$id.med)
## # A tibble: 2 × 81
## Player Season Team Position GP TOI_All G A1 A2 Points iSF
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nathan Bas… 20-21 N.J R 41 450. 3 4 3 10 44
## 2 Bo Horvat 20-21 VAN C 56 823. 10 7 5 22 91
## # … with 70 more variables: iFF <dbl>, iCF <dbl>, ixG <dbl>, iSh% <dbl>,
## # FSh% <dbl>, xFSh% <dbl>, iBLK <dbl>, GIVE <dbl>, TAKE <dbl>, iHF <dbl>,
## # iHA <dbl>, iPENT2 <dbl>, iPEND2 <dbl>, iPENT5 <dbl>, iPEND5 <dbl>,
## # iPEN± <dbl>, FOW <dbl>, FOL <dbl>, FO± <dbl>, TOI_5V5 <dbl>, GF% <dbl>,
## # SF% <dbl>, FF% <dbl>, CF% <dbl>, xGF% <dbl>, GF/60 <dbl>, GA/60 <dbl>,
## # SF/60 <dbl>, SA/60 <dbl>, FF/60 <dbl>, FA/60 <dbl>, CF/60 <dbl>,
## # CA/60 <dbl>, xGF/60 <dbl>, xGA/60 <dbl>, G±/60 <dbl>, S±/60 <dbl>, …
pamclust%>%mutate(position=clean_data$Forward)%>%ggplot(aes(primary_points/60,GAR, color=position, shape=cluster))+geom_point()
There were way too many numeric variables to use all of them in the clustering activity, so I chose the following from the original dataset: G/60, Points/60, primary_points/60, GAR, xGAR. Then, I examined average silhouette width and determined that 2 was the optimal number of clusters. I determined average silhouette width, which came out to be about 0.47. I then visualized the clusters by showing all pairwise combinations of clusters and coloring them by cluster. After that, I sliced the original dataset to find the final mediods who were most representative of their cluster, which were Nathan Bastian and Bo Horvat. Finally, I plotted the primary points/60 vs GAR and colored based on position (1 is a forward, 0 is a defenseman), and used shape for the cluster.
The average silhouette width of 0.47 falls into the “the structure is weak and could be artificial” group. When visualizing the clusters, it seemed as if in almost all the pairs, the points overall seemed to center around the same line. One cluster consisted of mostly points on the bottom half of the trend line, while the other cluster contained points mostly on the upper half of the trend line. The pair with the most distinct separation of the clusters were (visually) the ones involving any sort of points (Points/60 vs Goals/60, or primary_points/60 and points/60). This makes sense as those who tend to score more goals or points would probably be skilled enough to have a larger portion of them be primary points, etc. What is interesting is that in the pair primary points/60 vs GAR, players in the cluster what tended to have more primary points/60 tended also to have higher GAR. This distinction can be seen when compared to points/60 vs GAR, where the clusters seemed to be more characterized by points/60, but not GAR.
When looking at the plot of primary points/60 vs GAR colored by position and shapes corresponding to cluster, it seems as if most defensemen fall into the first cluster, tending to have lower primary points/60 and relatively lower GAR. Most forwards are in cluster 2, with all the players with the most primary points/60 being forwards in cluster 2.
pca_data <- clean_data %>% select(c(EVD_GAR,PPO_GAR,SHD_GAR,Take_GAR,Draw_GAR,Off_GAR,Def_GAR,Pens_GAR))
pca_data_nums<- pca_data %>% scale
rownames(pca_data_nums)<-clean_data$Player
head(pca_data_nums)
## EVD_GAR PPO_GAR SHD_GAR Take_GAR Draw_GAR
## Aaron Ekblad 0.4502207 1.83726191 -0.4290565 -0.9526899 0.46232209
## Adam Erne -1.0781304 0.02133641 -0.0284772 -0.3706629 -1.66294941
## Adam Fox 2.1218548 1.45894409 -1.0966887 1.8119383 -0.81284081
## Adam Henrique -0.9826085 -0.35698140 -0.5625829 1.3754181 -1.37957987
## Adam Lowry 0.4024598 -2.24857047 2.6420515 0.9388978 0.03726779
## Adam Pelech 3.6024450 -0.50830853 -2.1649001 0.7933911 -0.10441697
## Off_GAR Def_GAR Pens_GAR
## Aaron Ekblad -0.30255677 0.2709823 -0.34646615
## Adam Erne -0.08033464 -0.9928446 -1.51783971
## Adam Fox 0.53694903 1.5348093 0.61193039
## Adam Henrique 0.63571442 -1.0800051 -0.02700064
## Adam Lowry -0.42601350 1.2297476 0.71841890
## Adam Pelech -0.10502599 2.5807350 0.50544189
gar_pca <-princomp(pca_data_nums, corr=T)
summary(gar_pca, loadings=T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.4575422 1.4497503 1.1407057 1.0590157 0.9437020
## Proportion of Variance 0.2661844 0.2633460 0.1630375 0.1405223 0.1115861
## Cumulative Proportion 0.2661844 0.5295305 0.6925680 0.8330903 0.9446764
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.66340766 0.0355995060 1.278126e-02
## Proportion of Variance 0.05514439 0.0001587919 2.046859e-05
## Cumulative Proportion 0.99982074 0.9999795314 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## EVD_GAR 0.490 0.382 0.166 0.191 0.338 0.655
## PPO_GAR -0.291 0.196 0.637 -0.684
## SHD_GAR 0.284 0.167 -0.354 -0.834 -0.124 0.234
## Take_GAR -0.240 0.342 -0.219 0.660 -0.294 0.503
## Draw_GAR -0.171 0.398 -0.262 -0.627 0.283 0.520
## Off_GAR -0.351 0.220 0.547 -0.137 0.711
## Def_GAR 0.539 0.404 0.157 -0.718
## Pens_GAR -0.306 0.548 -0.356 -0.690
#scree plot
eigval<-gar_pca$sdev^2 #square to convert SDs to eigenvalues
varprop=round(eigval/sum(eigval), 2) #proportion of var explained by each PC
ggplot() + geom_bar(aes(y=varprop, x=1:8), stat="identity") + xlab("") +
geom_text(aes(x=1:8, y=varprop, label=round(varprop, 2)), vjust=1, col="white", size=5) +
scale_y_continuous(breaks=seq(0, .6, .2), labels = scales::percent) +
scale_x_continuous(breaks=1:10)
round(cumsum(eigval)/sum(eigval), 2)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## 0.27 0.53 0.69 0.83 0.94 1.00 1.00 1.00
PC1, PC2, PC3, PC4 account for 83% of the total variance, so I will keep PC1 - PC4.
#scores in dataframe
hockeypcadf<-data.frame(Player=clean_data$Player, PC1=gar_pca$scores[, 1],PC2=gar_pca$scores[, 2], PC3=gar_pca$scores[, 3], PC4=gar_pca$scores[, 4])
#PC1 and PC2
ggplot(hockeypcadf, aes(PC1, PC2)) + geom_point()
#PC1 and PC3
ggplot(hockeypcadf, aes(PC1, PC3)) + geom_point()
#PC1 and PC4
ggplot(hockeypcadf, aes(PC1, PC4)) + geom_point()
#PC2 and PC3
ggplot(hockeypcadf, aes(PC2, PC3)) + geom_point()
#PC2 and PC4
ggplot(hockeypcadf, aes(PC2, PC4)) + geom_point()
#PC3 and PC4
ggplot(hockeypcadf, aes(PC3, PC4)) + geom_point()
cor(gar_pca$scores) %>% round(10)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Comp.1 1 0 0 0 0 0 0 0
## Comp.2 0 1 0 0 0 0 0 0
## Comp.3 0 0 1 0 0 0 0 0
## Comp.4 0 0 0 1 0 0 0 0
## Comp.5 0 0 0 0 1 0 0 0
## Comp.6 0 0 0 0 0 1 0 0
## Comp.7 0 0 0 0 0 0 1 0
## Comp.8 0 0 0 0 0 0 0 1
I chose to retain PC1, PC2, PC3, and PC4 because they accounted for >80% of the total variance. PC2 is actually the general ability axis because all the loadings have the same sign. The higher a player scores on PC2, the better they are overall in terms of GAR (the variables are the components that make up GAR). PC1 is a defensive axis: EVD/SHD/Def vs. PPO/Take/Draw/Off/Pens. EVD/SHD/Def are even strength defense, short handed defense, and defense overall, so the higher the person scores on this axis the better their defensive abilities (positive signs) and the worse their other abilities (negative signs). PC3 mostly an offensive axis: PPO/Off have the highest magnitudes and positive signs (and EVD and Def with much smaller magnitudes). This means a player who scores well on PC3 is better at offensive components like powerplay offense (PPO) and Offense (Off), but worse at takeaways (take), faceoffs (draw), and penalty differentials (pens). PC4 is EVD/Take vs. SHD/Draw. Those who score highly in PC4 are better at EVD and takeaways, and worse at shorthanded defense and faceoffs.
There appears to be no correlation between the components, which I determined using the correlation matrix and using the visual plots of the PC’s against each other.
num_data <- clean_data%>%select(-c(Player,Season,Team,Position))
head(num_data)
## # A tibble: 6 × 77
## GP TOI_All G A1 A2 Points iSF iFF iCF ixG `iSh%` `FSh%`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 35 603. 4 2 2 8 57 79 113 2.59 7.02 5.06
## 2 45 537. 8 0 5 13 61 78 92 6.34 13.1 10.3
## 3 55 954. 2 8 8 18 65 92 120 3.48 3.08 2.17
## 4 45 568. 10 5 2 17 58 78 100 5.21 17.2 12.8
## 5 52 621. 9 6 6 21 65 85 112 6.8 13.8 10.6
## 6 56 1000. 3 4 5 12 72 108 162 3.42 4.17 2.78
## # … with 65 more variables: xFSh% <dbl>, iBLK <dbl>, GIVE <dbl>, TAKE <dbl>,
## # iHF <dbl>, iHA <dbl>, iPENT2 <dbl>, iPEND2 <dbl>, iPENT5 <dbl>,
## # iPEND5 <dbl>, iPEN± <dbl>, FOW <dbl>, FOL <dbl>, FO± <dbl>, TOI_5V5 <dbl>,
## # GF% <dbl>, SF% <dbl>, FF% <dbl>, CF% <dbl>, xGF% <dbl>, GF/60 <dbl>,
## # GA/60 <dbl>, SF/60 <dbl>, SA/60 <dbl>, FF/60 <dbl>, FA/60 <dbl>,
## # CF/60 <dbl>, CA/60 <dbl>, xGF/60 <dbl>, xGA/60 <dbl>, G±/60 <dbl>,
## # S±/60 <dbl>, F±/60 <dbl>, C±/60 <dbl>, xG±/60 <dbl>, Sh% <dbl>, …
#picked 11 variables because I have a lot of numeric ones
fit <- glm(Forward ~ G + Points + EVO_GAR + EVD_GAR + SHD_GAR + Take_GAR + Draw_GAR + Off_GAR + Def_GAR + Pens_GAR + GAR, data=num_data, family="binomial")
score <- predict(fit, type="response")
class_diag(score,num_data$Forward,positive=1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8294 0.8781 0.7343 0.8657 0.8719 0.8062 0.9058
set.seed(1234)
k=10 #choose number of folds
data<-num_data[sample(nrow(num_data)),] #randomly order rows
folds<-cut(seq(1:nrow(num_data)),breaks=k,labels=F) #create 10 folds
diags<-NULL
for(i in 1:k){
## Create training and test sets
train<-data[folds!=i,]
test<-data[folds==i,]
truth<-test$Forward
## Train model on training set
fit<-glm(Forward~G + Points + EVO_GAR + EVD_GAR + SHD_GAR + Take_GAR + Draw_GAR + Off_GAR + Def_GAR + Pens_GAR + GAR,data=train,family="binomial")
probs<-predict(fit,newdata = test,type="response")
## Test model on test set (save all k results)
diags<-rbind(diags,class_diag(probs,truth, positive=1))
}
summarize_all(diags,mean)
## acc sens spec ppv f1 ba auc
## 1 0.80305 0.86554 0.69059 0.84521 0.85218 0.77807 0.89205
The AUC when doing just the linear regression model was 0.9058, which is in the range for “great”, in terms of how well the model is doing (the range for great is 0.9-1.0 for AUC). The AUC when doing cross validation fell slightly, and fell under the threshold for “great” to “good”. This also means that there is overfitting, as the AUC went down.
library(caret)
fit <- train(Forward ~ G + Points + EVO_GAR + EVD_GAR + SHD_GAR + Take_GAR + Draw_GAR + Off_GAR + Def_GAR + Pens_GAR + GAR , data=num_data, method="rpart")
library(rpart.plot)
rpart.plot(fit$finalModel,digits=4)
set.seed(1234)
cv <- trainControl(method="cv", number = 10, classProbs = T, savePredictions = T)
fit <- train(Forward ~ G + Points + EVO_GAR + EVD_GAR + SHD_GAR + Take_GAR + Draw_GAR + Off_GAR + Def_GAR + Pens_GAR + GAR, data=num_data, trControl=cv, method="rpart")
head(fit)
## $method
## [1] "rpart"
##
## $modelInfo
## $modelInfo$label
## [1] "CART"
##
## $modelInfo$library
## [1] "rpart"
##
## $modelInfo$type
## [1] "Regression" "Classification"
##
## $modelInfo$parameters
## parameter class label
## 1 cp numeric Complexity Parameter
##
## $modelInfo$grid
## function (x, y, len = NULL, search = "grid")
## {
## dat <- if (is.data.frame(x))
## x
## else as.data.frame(x, stringsAsFactors = TRUE)
## dat$.outcome <- y
## initialFit <- rpart::rpart(.outcome ~ ., data = dat, control = rpart::rpart.control(cp = 0))$cptable
## initialFit <- initialFit[order(-initialFit[, "CP"]), , drop = FALSE]
## if (search == "grid") {
## if (nrow(initialFit) < len) {
## tuneSeq <- data.frame(cp = seq(min(initialFit[, "CP"]),
## max(initialFit[, "CP"]), length = len))
## }
## else tuneSeq <- data.frame(cp = initialFit[1:len, "CP"])
## colnames(tuneSeq) <- "cp"
## }
## else {
## tuneSeq <- data.frame(cp = unique(sample(initialFit[,
## "CP"], size = len, replace = TRUE)))
## }
## tuneSeq
## }
## <bytecode: 0x7fb3378ed0e8>
##
## $modelInfo$loop
## function (grid)
## {
## grid <- grid[order(grid$cp, decreasing = FALSE), , drop = FALSE]
## loop <- grid[1, , drop = FALSE]
## submodels <- list(grid[-1, , drop = FALSE])
## list(loop = loop, submodels = submodels)
## }
##
## $modelInfo$fit
## function (x, y, wts, param, lev, last, classProbs, ...)
## {
## cpValue <- if (!last)
## param$cp
## else 0
## theDots <- list(...)
## if (any(names(theDots) == "control")) {
## theDots$control$cp <- cpValue
## theDots$control$xval <- 0
## ctl <- theDots$control
## theDots$control <- NULL
## }
## else ctl <- rpart::rpart.control(cp = cpValue, xval = 0)
## if (!is.null(wts))
## theDots$weights <- wts
## modelArgs <- c(list(formula = as.formula(".outcome ~ ."),
## data = if (is.data.frame(x)) x else as.data.frame(x,
## stringsAsFactors = TRUE), control = ctl), theDots)
## modelArgs$data$.outcome <- y
## out <- do.call(rpart::rpart, modelArgs)
## if (last)
## out <- rpart::prune.rpart(out, cp = param$cp)
## out
## }
## <bytecode: 0x7fb350456f28>
##
## $modelInfo$predict
## function (modelFit, newdata, submodels = NULL)
## {
## if (!is.data.frame(newdata))
## newdata <- as.data.frame(newdata, stringsAsFactors = TRUE)
## pType <- if (modelFit$problemType == "Classification")
## "class"
## else "vector"
## out <- predict(modelFit, newdata, type = pType)
## if (!is.null(submodels)) {
## tmp <- vector(mode = "list", length = nrow(submodels) +
## 1)
## tmp[[1]] <- out
## for (j in seq(along = submodels$cp)) {
## prunedFit <- rpart::prune.rpart(modelFit, cp = submodels$cp[j])
## tmp[[j + 1]] <- predict(prunedFit, newdata, type = pType)
## }
## out <- tmp
## }
## out
## }
## <bytecode: 0x7fb334de41b0>
##
## $modelInfo$prob
## function (modelFit, newdata, submodels = NULL)
## {
## if (!is.data.frame(newdata))
## newdata <- as.data.frame(newdata, stringsAsFactors = TRUE)
## out <- predict(modelFit, newdata, type = "prob")
## if (!is.null(submodels)) {
## tmp <- vector(mode = "list", length = nrow(submodels) +
## 1)
## tmp[[1]] <- out
## for (j in seq(along = submodels$cp)) {
## prunedFit <- rpart::prune.rpart(modelFit, cp = submodels$cp[j])
## tmpProb <- predict(prunedFit, newdata, type = "prob")
## tmp[[j + 1]] <- as.data.frame(tmpProb[, modelFit$obsLevels,
## drop = FALSE], stringsAsFactors = TRUE)
## }
## out <- tmp
## }
## out
## }
##
## $modelInfo$predictors
## function (x, surrogate = TRUE, ...)
## {
## out <- as.character(x$frame$var)
## out <- out[!(out %in% c("<leaf>"))]
## if (surrogate) {
## splits <- x$splits
## splits <- splits[splits[, "adj"] > 0, ]
## out <- c(out, rownames(splits))
## }
## unique(out)
## }
##
## $modelInfo$varImp
## function (object, surrogates = FALSE, competes = TRUE, ...)
## {
## if (nrow(object$splits) > 0) {
## tmp <- rownames(object$splits)
## rownames(object$splits) <- 1:nrow(object$splits)
## splits <- data.frame(object$splits)
## splits$var <- tmp
## splits$type <- ""
## frame <- as.data.frame(object$frame, stringsAsFactors = TRUE)
## index <- 0
## for (i in 1:nrow(frame)) {
## if (frame$var[i] != "<leaf>") {
## index <- index + 1
## splits$type[index] <- "primary"
## if (frame$ncompete[i] > 0) {
## for (j in 1:frame$ncompete[i]) {
## index <- index + 1
## splits$type[index] <- "competing"
## }
## }
## if (frame$nsurrogate[i] > 0) {
## for (j in 1:frame$nsurrogate[i]) {
## index <- index + 1
## splits$type[index] <- "surrogate"
## }
## }
## }
## }
## splits$var <- factor(as.character(splits$var))
## if (!surrogates)
## splits <- subset(splits, type != "surrogate")
## if (!competes)
## splits <- subset(splits, type != "competing")
## out <- aggregate(splits$improve, list(Variable = splits$var),
## sum, na.rm = TRUE)
## }
## else {
## out <- data.frame(x = numeric(), Variable = character())
## }
## allVars <- colnames(attributes(object$terms)$factors)
## if (!all(allVars %in% out$Variable)) {
## missingVars <- allVars[!(allVars %in% out$Variable)]
## zeros <- data.frame(x = rep(0, length(missingVars)),
## Variable = missingVars)
## out <- rbind(out, zeros)
## }
## out2 <- data.frame(Overall = out$x)
## rownames(out2) <- out$Variable
## out2
## }
##
## $modelInfo$levels
## function (x)
## x$obsLevels
##
## $modelInfo$trim
## function (x)
## {
## x$call <- list(na.action = (x$call)$na.action)
## x$x <- NULL
## x$y <- NULL
## x$where <- NULL
## x
## }
##
## $modelInfo$tags
## [1] "Tree-Based Model" "Implicit Feature Selection"
## [3] "Handle Missing Predictor Data" "Accepts Case Weights"
##
## $modelInfo$sort
## function (x)
## x[order(x[, 1], decreasing = TRUE), ]
##
##
## $modelType
## [1] "Regression"
##
## $results
## cp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.08300762 0.4013210 0.3109312 0.2990674 0.04332833 0.15957703 0.04961674
## 2 0.09307666 0.4086603 0.2899958 0.3108957 0.04001578 0.16130592 0.04460322
## 3 0.28186018 0.4728159 0.1069564 0.4221004 0.02912230 0.06102981 0.05173063
##
## $pred
## pred obs rowIndex cp Resample
## 1 0.9393939 1 4 0.08300762 Fold01
## 2 0.9393939 1 15 0.08300762 Fold01
## 3 0.9393939 1 30 0.08300762 Fold01
## 4 0.9393939 1 36 0.08300762 Fold01
## 5 0.1750000 0 41 0.08300762 Fold01
## 6 0.9393939 1 47 0.08300762 Fold01
## 7 0.9393939 1 53 0.08300762 Fold01
## 8 0.9393939 1 57 0.08300762 Fold01
## 9 0.7007874 0 62 0.08300762 Fold01
## 10 0.7007874 1 75 0.08300762 Fold01
## 11 0.9393939 1 101 0.08300762 Fold01
## 12 0.9393939 0 108 0.08300762 Fold01
## 13 0.9393939 1 110 0.08300762 Fold01
## 14 0.1750000 0 113 0.08300762 Fold01
## 15 0.7007874 1 122 0.08300762 Fold01
## 16 0.9393939 1 133 0.08300762 Fold01
## 17 0.7007874 0 139 0.08300762 Fold01
## 18 0.1750000 0 140 0.08300762 Fold01
## 19 0.7007874 0 143 0.08300762 Fold01
## 20 0.7007874 0 146 0.08300762 Fold01
## [ reached 'max' / getOption("max.print") -- omitted 1246 rows ]
##
## $bestTune
## cp
## 1 0.08300762
class_diag(fit$pred$pred, fit$pred$obs, positive=1)
## acc sens spec ppv f1 ba auc
## Metrics 0.6809 0.828 0.3939 0.7272 0.7743 0.6109 0.7332
Interpretation of the classification tree: The classification tree picked G (goals) as the variable to split on. Initially, 66.11% of the players are forwards. It then splits by whether a player scored less than 6 goals. If a player scored more than 6 goals, there is a 94.15% chance the player is a forward. If a player scored less than 6 goals, it then splits to see whether a player scored less than 2 goals. If they scored less than 2 goals, there is an 11.29% chance the player is a forward. If the player scored between 2-6 goals, there is a 55.23% chance that player is a forward.
When running cross-validation on this data, the AUC was 0.7332. This means the model performed poorly.
fit<-lm(GAR~ G + Points,data=num_data)
yhat<-predict(fit)
mean((num_data$GAR-yhat)^2)
## [1] 11.83649
class_diag(score,num_data$Forward,positive=1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8294 0.8781 0.7343 0.8657 0.8719 0.8062 0.9058
set.seed(1234)
k=10 #choose number of folds
data<-num_data[sample(nrow(num_data)),] #randomly order rows
folds<-cut(seq(1:nrow(num_data)),breaks=k,labels=F) #create folds
diags<-NULL
for(i in 1:k){
train<-data[folds!=i,]
test<-data[folds==i,]
## Fit linear regression model to training set
fit<-lm(GAR~G + Points,data=train)
## Get predictions/y-hats on test set (fold i)
yhat<-predict(fit,newdata=test)
## Compute prediction error (MSE) for fold i
diags<-mean((test$GAR-yhat)^2)
}
mean(diags)
## [1] 16.4382
When running a linear regression model, the MSE for the entire dataset was 11.84, which was high. This means the model does not fit well. When cross validation was run on the model, the MSE was 16.44, which means that there was a sign of overfitting.
library(reticulate)
#favplayer = "Auston Matthews"
#print("My favorite player is ",py$favplayer)
I don’t know why it’s not working so I commented it out (it keeps freezing and nothing happens), but the code is supposed to print “My favorite player is Auston Matthews” by referring to the favplayer variable. The favplayer variable is a python variable, which I call using py$favplayer, and print it in a chunk of R code.